Add IsNodeCandidateForScaleDown cloud provider interface #8531

elmiko · 2025-09-12T19:39:59Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

this is a challenging scenario to debug, please see the related issue.

Does this PR introduce a user-facing change?

The ClusterAPI provider will not scale down a MachineDeployment that is undergoing an update.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-09-12T19:40:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-09-12T19:40:32Z

i'm still working on some clusterapi specific unit tests, but they are quite challenging given the mocks that are needed.

elmiko · 2025-09-16T16:47:17Z

+1 to what @sbueringer is saying, and also this problem is currently confined to clusterapi provider but it could affect any provider who does node updating in a similar fashion as clusterapi. i think we need to make the autoscaler smarter in these scenarios where a cloud provider needs to have more control over which nodes are being marked for removal during a maintenance window.

elmiko · 2025-09-16T19:16:31Z

updated with @sbueringer 's suggestions.

sbueringer · 2025-09-17T08:04:37Z

Answered above

This function allows cloud providers to specify when a node is not a good candidate for scaling down. This will occur before the autoscaler has begun to cordon, drain, and taint any node for scale down. Also adds a unit test for the prefiltering node processor.

The initial implementation of this function for clusterapi will return that a node is not a good candidate for scale down when it belongs to a MachineDeployment that is currently rolling out an upgrade.

elmiko · 2025-09-17T14:49:20Z

updated again with the change i missed and the correction to the logic.

sbueringer · 2025-09-17T15:22:22Z

Prod code changes lgtm from my side

aleksandra-malinowska · 2025-09-17T15:39:55Z

No, this is about preventing cluster autoscaler to delete / scale down a Node altogether if Cluster API is doing a rollout.

Adding this annotation to a node always prevents CA from removing it. How is this insufficient?

The "if Cluster API is doing a rollout" part would have to be handled on the Cluster API side, of course. Since it's already cordoning and draining the node, it seems reasonable to assume it could add an annotation, too.

sbueringer · 2025-09-17T16:28:29Z

The "if Cluster API is doing a rollout" part would have to be handled on the Cluster API side, of course. Since it's already cordoning and draining the node, it seems reasonable to assume it could add an annotation, too.

At the point where the issue occurs Cluster API is not even aware of that autoscaler is cordoning/tainting/draining the Node

cluster-autoscaler will cordon/taint/drain the node and then eventually at the end in RemoveNodes tell Cluster API to delete the Node/Machine. Up until that point Cluster API has no idea about what autoscaler has been doing with the Node. Until then it assume this Node/Machine is perfectly fine and usable.

elmiko · 2025-09-17T20:05:04Z

i think @sbueringer summed it up nicely, we need a way for the cloud provider to have an influence on how the autoscaler will perform, or not, the steps prior to reducing the size of a node group.

i think it is also worth noting that for cloud providers who may create more nodes that their node group size indicates, we need an automatic way for the provider to indicate to the autoscaler that a node should be excluded from being considered for download. although adding the annotation might help in some scenarios, i think we need a mechanism that the provider could exercise which does not require an update to a resource object or api server.

aleksandra-malinowska · 2025-09-18T09:56:05Z

At the point where the issue occurs Cluster API is not even aware of that autoscaler is cordoning/tainting/draining the Node

Thanks for clarifying that draining referred to in #8494 is part of the scale-down process, not upgrade process.

From the chart in #8494 I assumed Cluster API was also draining nodes before deleting the underlying machine as part of the upgrade (hence "CA observes (...) old nodes that have been cordoned/drained").

The problem is that if autoscaler tries to delete / scale down a Node during a rollout there's a high chance that it will end up deleting the wrong Node (and that then repeats until we have no Nodes anymore for a node group)

+1 to what @sbueringer is saying, and also this problem is currently confined to clusterapi provider but it could affect any provider who does node updating in a similar fashion as clusterapi. i think we need to make the autoscaler smarter in these scenarios where a cloud provider needs to have more control over which nodes are being marked for removal during a maintenance window.

CA's job in a scale-down scenarios is "choose which node will be deleted and make it happen". If something else overrides CA's choice and chooses which node will be removed, then CA can't do its job.

I don't think there's any way to make CA smarter as in "allow it to do its job even when an external API is actively preventing it". Quite possibly the best it can do is to stop trying (disable scale-down) when it believes an external API will override its decisions anyway.

This is exactly what this PR does, and to be clear, I don't object to this mitigation. My only objection is adding a new method, and one very specific to this scenario, to the CloudProvider interface. There are other ways to achieve the same effect, and I'd like to explore them before growing that interface.

If adding the annotation isn't possible because Cluster API doesn't drain nodes before deleting machines (does it?), then perhaps this can be handled at a node group level? Does a MachineDeployment or a MachineSet correspond to a NodeGroup object? If yes, the option to disable scale-down can probably be added to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/config/autoscaling_options.go#L39.

sbueringer · 2025-09-18T11:23:46Z

If adding the annotation isn't possible because Cluster API doesn't drain nodes before deleting machines

CAPI does drain Nodes before it deletes Machines, but as mentioned above CAPI is not aware that cluster-autoscaler is starting to delete the Node. This works something like this:

autoscaler gets scale down candidates
autoscaler selects a Machine for scale down
autoscaler: cordons, taints and drains the Node
autoscaler calls RemoveNodes which means it triggers Machine deletion in CAPI (by scaling down the MD and marking a Machine so it gets prioritized for deletion)

Only after 4. CAPI is aware that this Node is in deletion, before that for CAPI this Machine is a healthy Machine like every other.

The problem we are trying to address is that during 4. autoscaler makes assumptions on how it can delete a specific Machine ("by scaling down the MD and marking a Machine so it gets prioritized for deletion") that are not true during MachineDeployment (MD) rollouts. Because of that we want to ensure that the autoscaler does not scale down a Machine when a MD is rolling out (including that it should not do steps 1-3).

The CAPI Machine deletion process looks like this: (very short version)

deletionTimestamp is set on Machine
CAPI waits for pre-drain hooks
CAPI cordons and drains the node (with some unique features that don't exist in autoscaler)
CAPI deletes infrastructure
CAPI deletes the Node
This is described in detail here: https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/machine_deletions

Another orthogonal problem in our opinion is that with CAPI the autoscaler should never do steps 1.-3. as CAPI has it's own logic for that.

But the current PR just tries to solve the issue during MD rollouts for now as it can have catastrophic consequences (scaling down a MD to 0 during MD rollouts).

CA's job in a scale-down scenarios is "choose which node will be deleted and make it happen". If something else overrides CA's choice and chooses which node will be removed, then CA can't do its job.

We are attempting to tell CA: "this MD is going through a rollout, you should not scale down any Node/Machine of a MD during rollout".

This is exactly what this PR does, and to be clear, I don't object to this mitigation. My only objection is adding a new method, and one very specific to this scenario, to the CloudProvider interface. There are other ways to achieve the same effect, and I'd like to explore them before growing that interface.

I personally don't have a strong opinion on how it's implemented, I only would like to be able to not have Nodes of a MD during rollout in scale down candidates so we can avoid the scale down entirely (including steps 1.-3.). I think today we can only block in step 4., which is way too late as the Node is already cordoned/tainted/drained at this point.

Does a MachineDeployment or a MachineSet correspond to a NodeGroup object?

For the current case a MachineDeployment corresponds to a NodeGroup object

aleksandra-malinowska · 2025-09-18T12:04:19Z

Thanks for the detailed explanation, much appreciated!

I personally don't have a strong opinion on how it's implemented, I only would like to be able to not have Nodes of a MD during rollout in scale down candidates so we can avoid the scale down entirely (including steps 1.-3.). I think today we can only block in step 4., which is way too late as the Node is already cordoned/tainted/drained at this point.

For the current case a MachineDeployment corresponds to a NodeGroup object

This sounds like a node group option will work then. It can be checked at the same point as in the current PR.

Another orthogonal problem in our opinion is that with CAPI the autoscaler should never do steps 1.-3. as CAPI has it's own logic for that.

Does Cluster API ensure that all pods from the node it selects in (2) will be able to run elsewhere in the cluster? If it does, then I guess it could work, although at this point CAPI may just as well do (4), too.

If not, doing (4) without (2) would risk some workloads becoming unschedulable before triggering a scale-up, essentially just causing an unnecessary disruption. I believe the only scenario where it's guaranteed to work well is removing 1 empty node at a time, which is pretty limited. But as you said, this discussion is unrelated to this PR.

sbueringer · 2025-09-18T13:36:53Z

This sounds like a node group option will work then. It can be checked at the same point as in the current PR.

I think you are right. The difference is that this only allows to disable scale down for an entire node group, not for single nodes. But our current use case is also only to disable it for an entire node group.

But as you said, this discussion is unrelated to this PR.

Yup, let's discuss this separately once we get to that

elmiko · 2025-09-18T15:19:21Z

we discussed this issue/pr at the sig office hours today, @towca suggested looking at the custom scale down processors as another possible option. i had looked at the scale down processors initially, but didn't arrive at a solution. i am going to investigate that path again as i tend to agree with @aleksandra-malinowska 's point about not adding complexity to the cloud provider interface if we don't need it.

wjunott · 2025-09-19T02:21:52Z

we discussed this issue/pr at the sig office hours today, @towca suggested looking at the custom scale down processors as another possible option. i had looked at the scale down processors initially, but didn't arrive at a solution. i am going to investigate that path again as i tend to agree with @aleksandra-malinowska 's point about not adding complexity to the cloud provider interface if we don't need it.

Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko

sbueringer · 2025-09-19T06:26:03Z

Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko

I think this stills applies: #8531 (comment)

Still seems like the riskiest option to me. I'm not sure I follow your suggestion. Just to highlight it again, we want to specifically only disable scale down during MD rollout (eg not scale up)

elmiko · 2025-09-19T12:33:35Z

Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko

no, we can't use NodeGroupForNode in this case. there is extra context related to the scale down process that cluster-api needs to understand.

elmiko · 2025-09-23T21:24:39Z

i have been investigating an alternative implementation that would allow cloud providers to inject custom processors into the core autoscaler configuration. so far, it seems that the processors change would also require at the least a refactor of the NewCloudProvider interface to allow passing the processors.

i think the idea of allowing cloud providers to add customer processors is actually really powerful. i will sketch that out as well, and bring it up for discussion this week at the sig meeting.

elmiko · 2025-09-26T14:30:52Z

we discussed this PR at the sig meeting yesterday. i am going to proceed with creating the alternate patch, i think it will give a better end result in terms of giving providers more functionality.

i will test the alternate solution to ensure it continues to solve the problem.

elmiko · 2025-09-26T22:35:25Z

the refactor is looking good and does not add complexity to the cloud provider interface, although it does modify how cloud providers are constructed.

i will post a PR once i finish some local testing.

elmiko · 2025-09-29T16:47:47Z

i have created #8583 as an alternate solution. it appears to solve the same issues in my local test environment and has the advantage of not introducing a new cloud provider interface function.

i think i prefer the 8583 solution, but i will leave both PRs open for a few days to gather comments. if there are no objections, i will most likely close this PR by the end of the week.

elmiko · 2025-10-02T18:58:01Z

i am closing this in favor of #8583. i think the other PR is less complex and gives us more options in the future.

k8s-ci-robot requested review from aleksandra-malinowska and apricote September 12, 2025 19:40

elmiko force-pushed the add-good-candidate-node-interface branch from cae87c7 to 150cf78 Compare September 16, 2025 19:16

elmiko added 2 commits September 17, 2025 10:40

add clusterapi IsNodeCandidateForScaleDown impl

a81913f

The initial implementation of this function for clusterapi will return that a node is not a good candidate for scale down when it belongs to a MachineDeployment that is currently rolling out an upgrade.

elmiko force-pushed the add-good-candidate-node-interface branch from 150cf78 to a81913f Compare September 17, 2025 14:48

elmiko changed the title ~~WIP Add good candidate node interface~~ Add good candidate node interface Sep 17, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2025

elmiko changed the title ~~Add good candidate node interface~~ Add IsNodeCandidateForScaleDown cloud provider interface Sep 17, 2025

elmiko mentioned this pull request Sep 29, 2025

Refactor cloud provider creation options #8583

Open

elmiko closed this Oct 2, 2025

Add IsNodeCandidateForScaleDown cloud provider interface #8531

Add IsNodeCandidateForScaleDown cloud provider interface #8531

Uh oh!

Conversation

elmiko commented Sep 12, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Sep 12, 2025

Uh oh!

elmiko commented Sep 12, 2025

Uh oh!

elmiko commented Sep 16, 2025

Uh oh!

elmiko commented Sep 16, 2025

Uh oh!

sbueringer commented Sep 17, 2025

Uh oh!

elmiko commented Sep 17, 2025

Uh oh!

sbueringer commented Sep 17, 2025

Uh oh!

aleksandra-malinowska commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aleksandra-malinowska commented Sep 18, 2025

Uh oh!

sbueringer commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aleksandra-malinowska commented Sep 18, 2025

Uh oh!

sbueringer commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko commented Sep 18, 2025

Uh oh!

wjunott commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko commented Sep 19, 2025

Uh oh!

elmiko commented Sep 23, 2025

Uh oh!

elmiko commented Sep 26, 2025

Uh oh!

elmiko commented Sep 26, 2025

Uh oh!

elmiko commented Sep 29, 2025

Uh oh!

elmiko commented Oct 2, 2025

Uh oh!

Uh oh!

aleksandra-malinowska commented Sep 17, 2025 •

edited

Loading

sbueringer commented Sep 17, 2025 •

edited

Loading

elmiko commented Sep 17, 2025 •

edited

Loading

sbueringer commented Sep 18, 2025 •

edited

Loading

sbueringer commented Sep 18, 2025 •

edited

Loading

wjunott commented Sep 19, 2025 •

edited

Loading

sbueringer commented Sep 19, 2025 •

edited

Loading